L3: Reading Data

Bogdan G. Popescu

John Cabot University

The Working Directory

The R environment always points to a certain directory on our computer, which is known as the working directory.

We can get the current working directory with getwd

Type that in a new chunk. What is your current working directory?

getwd()

The Working Directory

The R environment always points to a certain directory on our computer, which is known as the working directory.

We can get the current working directory with getwd

Type that in a new chunk. What is your current working directory?

getwd()
[1] "/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/big_data/week3"

The Working Directory on MacOS

If your working directory is not week 3, create a new folder called “week3”

Then set the directory using setwd

To know the directory on a MacOS, you go to the week3 folder and click on “Get Info”

The Working Directory on MacOS

The Working Directory on MacOS

The Working Directory on MacOS

Select what is in the “Where” section

The Working Directory on MacOS

Select what is in the “Where” section and copy it using “Cmd + C”

The Working Directory on MacOS

In this case, the working directory will be:

[1] "/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/big_data"

We now simply need to add week3 to this path

Our final path will be:

[1] "/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/big_data/week3"

To set this directory, we type in:

setwd("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/big_data/week3")

The Working Directory on Windows

To do the same on Windows, you go to the week3 folder and right-click on “Properties”

The Working Directory on Windows

To do the same on Windows, you go to the week3 folder and right-click on “Properties”

The Working Directory on Windows

To do the same on Windows, you go to the week3 folder and right-click on “Properties”

The Working Directory on Windows

To do the same on Windows, you go to the week3 folder and right-click on “Properties”

The Working Directory on Windows

In this case, the working directory will be:

\\Mac\Dropbox-1\john_cabot\teaching\big_data

We first need to change all the backslashes to forward slashes:

[1] "/Mac/Dropbox-1/john_cabot/teaching/big_data"

We now simply need to add week3 to this path

Our final path will be:

[1] "/Mac/Dropbox-1/john_cabot/teaching/big_data/week3"

To set this directory, we type in:

setwd("/Mac/Dropbox-1/john_cabot/teaching/big_data/week3")

The Working Directory on Windows

In your case, this will look something like:

setwd("C:/Dropbox/john_cabot/teaching/big_data/week3")

Importing Data into R

The easiest datasets to work with are csv files

R is also capabale of reading a variety of other dataset formats including: .dta, .sas, .xlsx, xls, txt, etc.

The type of library used to read these files will have implications for quickly your computer can read the data

Let us look at some examples

  • read_csv
  • read.csv

Importing Data into R

  • read_csv
library("readr")
#This is to set the directory
setwd("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/big_data/week3/")
# Reading data
data_1851_obs10000<-read_csv("./data/data_1851_obs10000.csv")

#Recording how long it takes

#Step1:Recoding your system's time
start_time <- Sys.time()
#Step2:Loding the data
data_1851_obs10000<-read_csv("./data/data_1851_obs10000.csv")
#Step3: Recording when it finishes
end_time <- Sys.time()
#Step4: Calculating the difference
time_taken_a <- end_time - start_time
#Step5: Priting the difference
time_taken_a
Time difference of 0.1398301 secs

Importing Data into R

  • read.csv
# Reading data
#Step1:Recoding your system's time
start_time <- Sys.time()
#Step2:Loding the data
data_1851_obs10000_b<-read.csv("./data/data_1851_obs10000.csv")
#Step3: Recording when it finishes
end_time <- Sys.time()
#Step4: Calculating the difference
time_taken_b <- end_time - start_time
#Step5: Priting the difference
time_taken_b
Time difference of 0.2031388 secs

Importing Data into R

  • read.csv
  • read_csv
  • read.csv or read_csv makes a difference
  • the difference is: time_taken_b - time_taken_a = 0.0633
  • This is hugely important for large datasets

Alernative File Formats

  • csv
  • xlsx and xls
  • spss
  • Stata

Examples of Reading Different Files

#Reading CSV files
csv_file<-read.csv("./data/data_1851_obs10000.csv")
#Reading a Stata file
#Library for reading Stata files
library("haven")
#Library for reading SPSS files
library("foreign")
stata_file<-read_dta("./data/data_1851_obs10000.dta")
spss_file<-read.spss("./data/data_1851_obs10000.sav", to.data.frame=TRUE)
#Library for reading Excel file
library("readxl")
excel_file<-read_excel("./data/data_1851_obs10000.xlsx")

Writing CSV files

You can easily write CSV files to your hard drive using the readr library

#load the tidyverse readr package
library(readr)
#writing data as csv
write_csv(excel_file, "./data/data_1851_obs10000_written.csv")

Tidy Data

  • Tidy Data = “standard way of mapping the meaning of a dataset to its structure.” (Hadley Wickham)
  • Within tidy data:
    • each variable forms a column
    • each observation forms a row
    • each cell is a single measurement
  • Tidy data means that all datasets are alike

Tidy Data

Tidy Data

Tidy Data

Tidy Data

Tidy datasets are all alike

Tidy Data

Tidy datasets are all alike

Tidy Data

Tidy datasets are all alike

Untidy Data

Messy data can be messy in their own way.

Untidy Data

Messy data can be messy in their own way.

Untidy Data

Messy data can be messy in their own way.

Untidy Data

Messy data can be messy in their own way.

Untidy Data

Messy data can be messy in their own way.

Tidy Data

Tidy Data

You can become friends with tidy data by following one of the following strategies:

  • Pivoting
    • Longer
    • Wider
  • Separating
  • Uniting

Untidy Data 1: Pivoting Longer

  • A common problem is that column names are not names of variables, but values of a variable
#table4a
#> # A tibble: 3 × 3
#>   country     `1999` `2000`
#> * <chr>        <int>  <int>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766
  • We have columns whose names are values, not variables: 1999 and 2000
  • We need a new column that reflects these two variable. Let’s call it year
  • We also need another column that reflects the numbers. Let’s call it cases

Untidy Data 1: Pivoting Longer

Untidy Data 1: Pivoting Longer

Untidy Data 1: Pivoting Longer

This is what that looks like in code

Original

#table4a
#> # A tibble: 3 × 3
#>   country     `1999` `2000`
#> * <chr>        <int>  <int>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766

Fix

#table4a %>% 
#  pivot_longer(c(`1999`, `2000`), 
#  names_to = "year",
#  values_to = "cases")
#> # A tibble: 6 × 3
#>   country     year   cases
#>   <chr>       <chr>  <int>
#> 1 Afghanistan 1999     745
#> 2 Afghanistan 2000    2666
#> 3 Brazil      1999   37737
#> 4 Brazil      2000   80488
#> 5 China       1999  212258
#> 6 China       2000  213766

Untidy Data 2: Pivoting Wider

  • This where one observation is scattered across multiple rows.
  • Within the table below, an observation is a country in a year, but each observation is spread across two rows.
#table4a
#> # A tibble: 12 × 4
#>   country      year type           count
#>   <chr>       <int> <chr>          <int>
#> 1 Afghanistan  1999 cases            745
#> 2 Afghanistan  1999 population  19987071
#> 3 Afghanistan  2000 cases           2666
#> 4 Afghanistan  2000 population  20595360
#> 5 Brazil       1999 cases          37737
#> 6 Brazil       1999 population 172006362
#> # … with 6 more rows

Untidy Data 2: Pivoting Wider

Untidy Data 2: Pivoting Wider

Untidy Data 2: Pivoting Wider

This is what that looks like in code

Original

#table4a
#> # A tibble: 12 × 4
#>   country      year type           count
#>   <chr>       <int> <chr>          <int>
#> 1 Afghanistan  1999 cases            745
#> 2 Afghanistan  1999 population  19987071
#> 3 Afghanistan  2000 cases           2666
#> 4 Afghanistan  2000 population  20595360
#> 5 Brazil       1999 cases          37737
#> 6 Brazil       1999 population 172006362
#> # … with 6 more rows

Fix

#table4a %>% 
#>   pivot_wider(names_from = type, 
#>   values_from = count)
#> # A tibble: 6 × 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

Untidy Data 3: Separating

  • We sometimes may have columns that contain data which should separated in multiple columns

  • This where we use separate()

#table3
#> # A tibble: 6 × 3
#>   country      year rate             
#> * <chr>       <int> <chr>            
#> 1 Afghanistan  1999 745/19987071     
#> 2 Afghanistan  2000 2666/20595360    
#> 3 Brazil       1999 37737/172006362  
#> 4 Brazil       2000 80488/174504898  
#> 5 China        1999 212258/1272915272
#> 6 China        2000 213766/1280428583

Untidy Data 3: Separating

Untidy Data 3: Separating

Untidy Data 3: Separating

This is what that looks like in code

Original

#> # A tibble: 6 × 3
#>   country      year rate             
#> * <chr>       <int> <chr>            
#> 1 Afghanistan  1999 745/19987071     
#> 2 Afghanistan  2000 2666/20595360    
#> 3 Brazil       1999 37737/172006362  
#> 4 Brazil       2000 80488/174504898  
#> 5 China        1999 212258/1272915272
#> 6 China        2000 213766/1280428583

Fix

#table3 %>% 
#  separate(rate, into = c("cases", "population"))
#> # A tibble: 6 × 4
#>   country      year cases  population
#>   <chr>       <int> <chr>  <chr>     
#> 1 Afghanistan  1999 745    19987071  
#> 2 Afghanistan  2000 2666   20595360  
#> 3 Brazil       1999 37737  172006362 
#> 4 Brazil       2000 80488  174504898 
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

Untidy Data 4: Uniting

  • Uniting is the inverse of separating

  • It combines multiple columns into a single column

#table5 %>% 
#> # A tibble: 6 × 4
#>   country     century year  rate             
#>   <chr>       <chr>   <chr> <chr>            
#> 1 Afghanistan 19      99    745/19987071     
#> 2 Afghanistan 20      00    2666/20595360    
#> 3 Brazil      19      99    37737/172006362  
#> 4 Brazil      20      00    80488/174504898  
#> 5 China       19      99    212258/1272915272
#> 6 China       20      00    213766/1280428583

Untidy Data 4: Uniting

Untidy Data 4: Uniting

Untidy Data 4: Uniting

This is what that looks like in code

Original

#table5
#> # A tibble: 6 × 4
#>   country     century year  rate             
#>   <chr>       <chr>   <chr> <chr>            
#> 1 Afghanistan 19      99    745/19987071     
#> 2 Afghanistan 20      00    2666/20595360    
#> 3 Brazil      19      99    37737/172006362  
#> 4 Brazil      20      00    80488/174504898  
#> 5 China       19      99    212258/1272915272
#> 6 China       20      00    213766/1280428583

Fix

#table5 %>% 
#  unite(new, century, year, sep = "")
#> # A tibble: 6 × 3
#>   country     new   rate             
#>   <chr>       <chr> <chr>            
#> 1 Afghanistan 1999  745/19987071     
#> 2 Afghanistan 2000  2666/20595360    
#> 3 Brazil      1999  37737/172006362  
#> 4 Brazil      2000  80488/174504898  
#> 5 China       1999  212258/1272915272
#> 6 China       2000  213766/1280428583

Missing Values 1

Here is an example:

library("tibble")
library("tidyr")
stocks <- tibble(
  year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
  qtr    = c(   1,    2,    3,    4,    2,    3,    4),
  return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
)
  • The return for the fourth quarter of 2015 is explicitly missing, because the cell where its value should be instead contains NA

  • The return for the first quarter of 2016 is implicitly missing, because it simply does not appear in the dataset.

Missing Values 1

To deal with them, we can use values_drop_na = TRUE

library("tibble")
library("tidyr")

stocks %>% 
  pivot_wider(names_from = year, values_from = return) %>% 
  pivot_longer(
    cols = c(`2015`, `2016`), 
    names_to = "year", 
    values_to = "return", 
    values_drop_na = TRUE
  )
# A tibble: 6 × 3
    qtr year  return
  <dbl> <chr>  <dbl>
1     1 2015    1.88
2     2 2015    0.59
3     2 2016    0.92
4     3 2015    0.35
5     3 2016    0.17
6     4 2016    2.66
  • Notice the use of values_drop_na = TRUE

  • Notice that the 4th quarter of 2015 is missing now

Missing Values 2

We can also use complete()

stocks %>% 
  complete(year, qtr)
# A tibble: 8 × 3
   year   qtr return
  <dbl> <dbl>  <dbl>
1  2015     1   1.88
2  2015     2   0.59
3  2015     3   0.35
4  2015     4  NA   
5  2016     1  NA   
6  2016     2   0.92
7  2016     3   0.17
8  2016     4   2.66
  • complete() takes a set of columns, and finds all unique combinations

  • It then ensures the original dataset contains all those values, filling in explicit NAs where necessary.

Missing Values 3

  • Sometimes the NA is not random

  • Missing values could indicate that the previous value should be carried forward:

Example:

library("dplyr")
treatment <- tribble(
  ~ person,           ~ treatment, ~response,
  "Derrick Whitmore", 1,           7,
  NA,                 2,           10,
  NA,                 3,           9,
  "Katherine Burke",  1,           4
)

Missing Values 3

We can deal with this problem by using fill()

Original

library("dplyr")
treatment <- tribble(
  ~ person,           ~ treatment, ~response,
  "Derrick Whitmore", 1,           7,
  NA,                 2,           10,
  NA,                 3,           9,
  "Katherine Burke",  1,           4
)

Fix

treatment %>% 
  fill(person)
# A tibble: 4 × 3
  person           treatment response
  <chr>                <dbl>    <dbl>
1 Derrick Whitmore         1        7
2 Derrick Whitmore         2       10
3 Derrick Whitmore         3        9
4 Katherine Burke          1        4

Relational Data

  • In the real world, people typically have to work with many tables

  • Multiple tables of data that need to be combined together, are called relational data

Verbs for Relational Data

  • Mutating joins
    • add new variables to one data frame from matching observations in another
  • Filtering joins
    • filter observations from one data frame based on whether or not they match an observation in the other table
  • Set operations
    • treat observations as if they were set elements

Examples: NYC Flights

We will first load two libraries

library(tidyverse)
library(nycflights13)

There are five databases (tibbles) inside nycflights13:

  • airlines
  • airports
  • planes
  • weather
  • flights

The airlines tibble

airlines
# A tibble: 16 × 2
   carrier name                       
   <chr>   <chr>                      
 1 9E      Endeavor Air Inc.          
 2 AA      American Airlines Inc.     
 3 AS      Alaska Airlines Inc.       
 4 B6      JetBlue Airways            
 5 DL      Delta Air Lines Inc.       
 6 EV      ExpressJet Airlines Inc.   
 7 F9      Frontier Airlines Inc.     
 8 FL      AirTran Airways Corporation
 9 HA      Hawaiian Airlines Inc.     
10 MQ      Envoy Air                  
11 OO      SkyWest Airlines Inc.      
12 UA      United Air Lines Inc.      
13 US      US Airways Inc.            
14 VX      Virgin America             
15 WN      Southwest Airlines Co.     
16 YV      Mesa Airlines Inc.         

The airports tibble

airports
# A tibble: 1,458 × 8
   faa   name                             lat    lon   alt    tz dst   tzone    
   <chr> <chr>                          <dbl>  <dbl> <dbl> <dbl> <chr> <chr>    
 1 04G   Lansdowne Airport               41.1  -80.6  1044    -5 A     America/…
 2 06A   Moton Field Municipal Airport   32.5  -85.7   264    -6 A     America/…
 3 06C   Schaumburg Regional             42.0  -88.1   801    -6 A     America/…
 4 06N   Randall Airport                 41.4  -74.4   523    -5 A     America/…
 5 09J   Jekyll Island Airport           31.1  -81.4    11    -5 A     America/…
 6 0A9   Elizabethton Municipal Airport  36.4  -82.2  1593    -5 A     America/…
 7 0G6   Williams County Airport         41.5  -84.5   730    -5 A     America/…
 8 0G7   Finger Lakes Regional Airport   42.9  -76.8   492    -5 A     America/…
 9 0P2   Shoestring Aviation Airfield    39.8  -76.6  1000    -5 U     America/…
10 0S9   Jefferson County Intl           48.1 -123.    108    -8 A     America/…
# ℹ 1,448 more rows

The weather tibble

weather
# A tibble: 26,115 × 15
   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
 1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
 2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
 3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
 4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
 5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
 6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
 7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0 
 8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4 
 9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0 
10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8 
# ℹ 26,105 more rows
# ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
#   visib <dbl>, time_hour <dttm>

The planes tibble

planes
# A tibble: 3,322 × 9
   tailnum  year type              manufacturer model engines seats speed engine
   <chr>   <int> <chr>             <chr>        <chr>   <int> <int> <int> <chr> 
 1 N10156   2004 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
 2 N102UW   1998 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 3 N103US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 4 N104UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 5 N10575   2002 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
 6 N105UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 7 N107US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 8 N108UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 9 N109UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
10 N110UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
# ℹ 3,312 more rows

The flights tibble

flights
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

The tibbles together

This is how the five tibbles could be connected to one another

The tibbles together

The tibbles together

The tibbles together

The tibbles together

The tibbles together

The tibbles together

The tibbles together

  • flights connects to planes via a single variable, tailnum

The tibbles together

  • flights connects to planes via a single variable, tailnum

The tibbles together

  • flights connects to airlines through the carrier variable

The tibbles together

  • flights connects to airports through: the origin and dest variables

The tibbles together

  • flights connects to airports through: the origin and dest variables

The tibbles together

  • flights connects to weather via origin (the location), and year, month, day and hour (the time)

Keys

  • The variables used to connect the tibbles are keys which are used to uniquely identify the observations

  • Sometimes one variable uniquely identifies the observation - tainum for planes

  • Other times, more than one variable is necessary - year, month, day, hour, and origin.

Keys

  • It is always good practice to make sure that the keys identified are in fact unique

  • This can be done this way for planes:

planes %>% 
  count(tailnum) %>% 
  filter(n > 1)
# A tibble: 0 × 2
# ℹ 2 variables: tailnum <chr>, n <int>
  • This means that tailnum uniquely identifies planes

Keys

  • We can try to uniquely identify observations for flights in the following way:
flights %>% 
  count(year, month, day, flight) %>% 
  filter(n > 1)
# A tibble: 29,768 × 5
    year month   day flight     n
   <int> <int> <int>  <int> <int>
 1  2013     1     1      1     2
 2  2013     1     1      3     2
 3  2013     1     1      4     2
 4  2013     1     1     11     3
 5  2013     1     1     15     2
 6  2013     1     1     21     2
 7  2013     1     1     27     4
 8  2013     1     1     31     2
 9  2013     1     1     32     2
10  2013     1     1     35     2
# ℹ 29,758 more rows
  • For flights, it seems that the combination of year, month, day, and flight is not enough to uniquely identify observations

Keys

  • We can use surrogate keys to uniquely identify observations
flights %>% 
  mutate(unique_obs = row_number()) %>%
  count(unique_obs)%>%
  filter(n > 1)
# A tibble: 0 × 2
# ℹ 2 variables: unique_obs <int>, n <int>

Types of Joins

  • A left join keeps all the observations in x.
  • A right join keeps all the observations in y
  • A full join keeps all the observations in both x and y

Left Join

Right Join

Full Join

Advice

  • The most common type of join that you will need in practive is a left join

  • This allows you to preserve the original observation even when there is not match

Caveat: Duplicate Keys

  • It is important to consider what happens when we have duplicate keys

  • Case 1: One Table has duplicated keys

Caveat: Duplicate Keys

  • It is important to consider what happens when we have duplicate keys

  • Case 2: Both Tables have duplicated keys

  • When you join duplicated keys, you get all possible combinations, the Cartesian product

Example of a Left Join

Left Join

  • flights connects to planes via a single variable, tailnum

Left Join

Step1: Selecting only relevant variables

library(tidyverse)
library(nycflights13)
flights2 <- flights %>% 
  select(day, hour, origin, dest, tailnum, carrier)
#flights$tailnum

Left Join

Step2: Performing the Left Join

fights_planes<-left_join(flights2, planes, by = c("tailnum" = "tailnum"))
fights_planes
# A tibble: 336,776 × 14
     day  hour origin dest  tailnum carrier  year type        manufacturer model
   <int> <dbl> <chr>  <chr> <chr>   <chr>   <int> <chr>       <chr>        <chr>
 1     1     5 EWR    IAH   N14228  UA       1999 Fixed wing… BOEING       737-…
 2     1     5 LGA    IAH   N24211  UA       1998 Fixed wing… BOEING       737-…
 3     1     5 JFK    MIA   N619AA  AA       1990 Fixed wing… BOEING       757-…
 4     1     5 JFK    BQN   N804JB  B6       2012 Fixed wing… AIRBUS       A320…
 5     1     6 LGA    ATL   N668DN  DL       1991 Fixed wing… BOEING       757-…
 6     1     5 EWR    ORD   N39463  UA       2012 Fixed wing… BOEING       737-…
 7     1     6 EWR    FLL   N516JB  B6       2000 Fixed wing… AIRBUS INDU… A320…
 8     1     6 LGA    IAD   N829AS  EV       1998 Fixed wing… CANADAIR     CL-6…
 9     1     6 JFK    MCO   N593JB  B6       2004 Fixed wing… AIRBUS       A320…
10     1     6 LGA    ORD   N3ALAA  AA         NA <NA>        <NA>         <NA> 
# ℹ 336,766 more rows
# ℹ 4 more variables: engines <int>, seats <int>, speed <int>, engine <chr>

Left Join by Multiple Keys

  • In the previous example, we looked at join by tailum

  • By we can also perform join by multiple variables

flights3 <-left_join(flights, weather, by = c("year", "month", "day", "hour", "origin"))
flights3
# A tibble: 336,776 × 29
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 21 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour.x <dttm>, temp <dbl>, dewp <dbl>,
#   humid <dbl>, wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
#   precip <dbl>, pressure <dbl>, visib <dbl>, time_hour.y <dttm>

Filtering Joins

  • Filtering joins match observations in the same way as mutating joins

  • They affect observations, not variables

    • semi_join(x, y) keeps all observations in x that have a match in y
    • anti_join(x, y) drops all observations in x that have a match in y

The Semi-Join

  • Only the existence of a match is important

  • It doesn’t matter which observation is matched

The Anti-Join

  • It is the inverse of the semi-join

  • The anti-join keeps the rows that don’t have a match

  • They are useful for diagnosing join mismatches

Dates and Times

When working spatial data, it is often necessary to work with time dates and time series

In other words, spatial data may contain temporal information

Dates and Times

Some common used classes to repsent time in R include:

Times:

  • Date - To represent data at day-by-day level
  • POSIXct and POSIXlt to represent data at a second-by-second level

Time Series:

  • ts
  • zoo (with the zoo package)
  • xts (with the xts package)

Working with Date objects

The simplest data structure to represent time is Date

For example, we can turn an object from character to Date by using as.Date function

x = "2014-10-20"
x
[1] "2014-10-20"
class(x)
[1] "character"
x2 = as.Date(x)
x2
[1] "2014-10-20"
class(x2)
[1] "Date"

Working with Date objects

The character values are in the standard format known as ISO 8601 (YYYY-MM-DD)

The as.Date fuction works without additional arguments

When the character values are in a non-standard format, we need to specify the format definition

Working with Date objects

The following table lists the most commonly used symbols for specifying date in R.

Symbol Example Meaning
%d “15” Day
%m “08” Month, Numeric
%b “Aug” Month, 3-letter
%B “August” Month, full
%y 14 Year, 2-digit
%Y 2014 Year, 4-digit

Working with Date objects: Examples

For example, we need to specify the format for certain date types.

as.Date("07/Aug/12")
## Error in charToDate(x) : 
##  character string is not in a standard unambiguous format

To fix the error, we need to type:

as.Date("07/Aug/12", format = "%d/%b/%y")

Working with Date objects: Examples

For example, we need to specify the format for certain date types.

as.Date("2012-August-07")
## Error in charToDate(x) : 
##  character string is not in a standard unambiguous format

To fix the error, we need to type:

as.Date("2012-August-07", format = "%Y-%B-%d")
[1] "2012-08-07"

Obtaining “Parts” of Date

format(as.Date("1955-11-30"), format = "%m/%Y")
[1] "11/1955"
format(as.Date("1955-11-30"), format = "%d")
[1] "30"

The following will not work

format("1955-11-30", format = "%m/%Y")
[1] "1955-11-30"
format("1955-11-30", format = "%d")
[1] "1955-11-30"

This is because the first two are Date objects while the last two are character objects

x1<-"1955-11-30"
class(x1)
[1] "character"
x2<-as.Date("1955-11-30")
class(x2)
[1] "Date"

Arithmetic operations with dates

Date objects act like numeric objects

We can thus perform:

  • conditional operations
  • addition and subtraction of/between dates
  • creating date sequences

Arithmetic operations with dates: subtraction

x2<-as.Date("1955-11-30")
x3<-as.Date("1955-11-01")
time_diff1<-x2-x3
time_diff1
Time difference of 29 days

Another example:

x2<-as.Date("1955-11-30")
x4<-as.Date("2023-12-20")
time_diff2<-x4-x2
time_diff2
Time difference of 24857 days

Arithmetic operations with dates: subtraction

If we look at time_diff1 and time_diff2, they are different classes

class(time_diff1)
[1] "difftime"
class(time_diff2)
[1] "difftime"

We can turn them to numeric classes and to specific units.

as.numeric(time_diff1, unit = "days")
[1] 29
as.numeric(time_diff2, unit = "weeks")
[1] 3551

Arithmetic operations with dates: sequences

We can also create a sequence of dates using seq.

For example, the following expression creates a sequence of dates.

x<-seq(from = as.Date("2018-10-14"), to = as.Date("2019-01-11"), by = 7)
x
 [1] "2018-10-14" "2018-10-21" "2018-10-28" "2018-11-04" "2018-11-11"
 [6] "2018-11-18" "2018-11-25" "2018-12-02" "2018-12-09" "2018-12-16"
[11] "2018-12-23" "2018-12-30" "2019-01-06"

Time Series: Subsestting

This is how we can subset our date vector.

x<-seq(from = as.Date("2018-10-14"), to = as.Date("2019-01-11"), by = 7)
x[x<"2019-01-06"]
 [1] "2018-10-14" "2018-10-21" "2018-10-28" "2018-11-04" "2018-11-11"
 [6] "2018-11-18" "2018-11-25" "2018-12-02" "2018-12-09" "2018-12-16"
[11] "2018-12-23" "2018-12-30"

Time Series: Consecutive Differences

The diff function can be used to create a vector of differences between consecutive elements

For example:

x = c(15, 8, 23, 24)
diff(x)
[1] -7 15  1

or

x<-seq(from = as.Date("2018-10-14"), to = as.Date("2019-01-11"), by = 7)
diff(x)
Time differences in days
 [1] 7 7 7 7 7 7 7 7 7 7 7 7

Time Series: Consecutive Differences

Note that the length of diff(x) is one element less that x

We don’t have the difference for the first (or last) element, depending how you look at it

length(x)
[1] 13
length(diff(x))
[1] 12